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Abstract 

Prior distributions for unknown data distributions play an important role 
in nonparametric Bayesian statistics. A commonly-used prior distribution for 
an unknown data distribution is the Dirichlet process, which induces a ran- 
dom partition on the observations from the unknown data distribution. We 
investigate the prediction rule that underlies the Dirichlet process prior and 
the implicit "rich-get-richer" characteristics of random partitions generated 
by this process. To provide more flexibility for the modeling of random par- 
titions, we present two alternative prior distributions for random partitions: 
the Pitman- Yor process and a uniform process. We present several asymp- 
totic results for partitions under each process as well as a simulation-based 
evaluation of partition properties in small samples. We also discuss the ex- 
changeability of partitions under each prediction rule. We give special focus 
to the uniform process which does not share the same "rich-get-richer" prop- 
erty as the Dirichlet process, which would be advantageous in applications 
where that implicit property is not reasonable. 
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1 Introduction 



Fully parametric Bayesian models are a pow^erful and popular approach to many 
difficult statistical problems. However, in many applied situations, practitioners 
are uncomfortable with the assumption of a parametric distribution for each level 
of the model, and a non-parametric or semi-parametric approach i s instead pre- 



ferred . For a recent review of Bayesian non-parametric modeling, see lMuller and Quintana 



(|2004{ ). A common characteristic of Bayesian non-parametric or semi-parametric 
models is that we have a set of observations y from an unknown probability dis- 
tribution P. In multi-level models, these observations can also represent latent 
variables or unknown parameters. A Bayesian model also requires the use of a 
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prior distribution for the unknown probability distribution P. Dirichlet processes 
are a class of prior distributions that have become ubiquitous in Bayesian non- 
parametric modeling. Let be a finite measure on a sample space S. A proba- 
bility measure P follows a Dirichlet process DP(/i) with characteristic measure /i if 
{P{Ai), P{An)) follows a Dirichlet distribution with parameter {^{Ai), 
for every finite partition Ai, A„ of 5 with Ai G S. Let 9 = /i(S) > and define 

K-)=M-)/^- 



iFergusonI ( 1973 ) proved the existence of Dirichlet processes, firs t using the Kol- 
mogorov extension theorem and then giving a constructive proof. Ip erguson ( 1973 ) 
also demonstrated the following key theorem about Dirichlet processes: 

Theorem 1.1. Suppose that Xi, . . . ,Xnis a sample from P and that P ~ DP(yu), then 



(1) 



where 5^ is a measure that gives mass 1 to the point x. 



A crucial consequence of this theorem is that the posterior distribution of P is 
inherently discretized by the point masses at each unique X^. In many applica- 
tions where density estimation is the desired result, this discrete ness is problematic 
and s o convolution with smooth kernels is typically employed. lEscobar and West 
(|l995l) discuss the use of Dirichlet mixtures of normals for density estimation. 
However, in many other applications, this discreteness can be utilized to cluster 
data or latent variables, effectively using the Dirichlet process as a prior distribu- 
tion for partitions of random variables. In this article, we will focus entirely on the 
use of this discreteness property and its consequences in problems concerned with 
the partitioning of random variables. 

The Dirichlet process has become a popular tool for clustering withi n hierarchi - 
cal Bayesian models. Some rece nt examples in the literature include iLiul (Il996h, 

Green and Richardson ( 200l h andl Medvedovic and Sivaganesan ( 2002 ), and Jensen and Liu 
(|2007n . However, in most applications, very little attention is given to the implicit 
assumptions about the structure of partitions generated by a Dirichlet process 
prior distribution. As we explore in Section |2] below, a fundamental characteris- 
tic of partitions generated by the Dirichlet process is a "rich-get-richer" property 
that leads to a priori partitions consisting of a small number of large clusters. This 
may be an undesirable property in many applications, and for these situations, a 
practitioner is confronted by the following questions: 



1. Are there alternatives to the Dirichlet process without this "rich-get-richer' 
property? 
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2. What are the corresponding properties of these alternative prior processes? 



In this work, we address these questions by exploring two alternative prior pro- 
cesses, the Pitman- Yor process and a uniform process, which show characteris- 
tics differing substantively from the Dirichlet process. We focus primarily on the 
uniform process as a particularly intriguing alternative for the modeling of ran- 
dom partitions, since it generates a dramatically different set of clustering char- 
acteristics compared to the Dirichlet process. We compare the uniform process to 
the Dirichlet and Pitman- Yor processes both in terms of asymptotic characteristics 
(Section |3]) as well as characteristics in reasonable sized samples (Section SJ. We 
briefly discuss some exchangeability issues in Section |5] and conclude with a brief 
discussion. 



2 Prediction Rules for Partition Priors 

We are interested in the partitioning of n variables Xi, . . . , X„. It is often conve- 
nient to describe this partitioning process by a "prediction rule": the sequence of 
generative conditional probabilities implied by a particular prior distribution. In 
this framework, we observe random variables Xi, . . . , X„ one at a time, and our 
partition is constructed sequentially. From Theorem 11.11 if we use a Dirichlet pro- 
cess prior for P, then the conditional distribution of a new observation X„+i is a 
mixture of point masses at the previous observations Xi, . . . , X„ and the underly- 
ing measure /i. If we define Xi and Xj to be in the same cluster if Xj = Xj, then we 
see that this prediction rule formulation sequentially constructs a partition, since 
Xn+i joins an existing cluster if Xn+i = Xj for some i < n, or alternatively, X„+i 
is drawn from the measure z/(-) = i-t{-)/9, which creates a new cluster consisting 
only of Xn+i- The parameter 9 plays the role of a prior weight for the formation of 
a new cluster. 

We can also write this prediction rule in terms of the current clusters in the parti- 
tion. Let Xi, . . . , Xk be the K distinct values (ie. K clusters) observed in the set of 
variables Xi, . . . , X„, and define = Xi, . . . , Nk such that Nk = Yl^=i K-^i = ^k) ie. 
Xfc is the size of cluster k. Then we have 

Corollary 2.1. Suppose that Xi, Xn+i ~ P and P ~ DP(/x). Then 



Pr(X„+i — Xfc|Xi, X„) 



l<k<K, 



Pr(X„4_i ^ {Xi, X„}|Xi, X„) 



9 



(2) 



n + 9 
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This formulation is evident in the popular Chinese restaurant construction of the 
Dirichlet process. Imagine a Chinese restaurant with an infinite number of in- 
finitely large tables. The restaurant is initially empty and the first customer en- 
ters the restaurant and sits at a table by himself. Customers continue to enter the 
restaurant, one at a time, and let the probability that the {n + l)-th customer to 
enter the restaurant sits next to the i-th customer {i < n) at their table be l/{n + 0). 
The probability that the {n + l)-th customer sits at a previously unoccupied table 
is 6/{n 6). With this construction, we see that the probability of the {n + l)-th 
customer joining a particular table k is proportional to the number of customers 
Nk already sitting at table k, which leads us to (O. The most obvious characteristic 
of this prediction rule is the "rich get richer" property: the probability of joining 
a cluster is proportional to the size of that cluster, which means that new obser- 
vations have a strong preference towards already large clusters. This preferential 
attachment property has been obs erved in a wide variety of natural settings, such 



as the study of scale-free networks (|Barabasi and Albertl,ll999 ). However, this pref 



erential attachment property may be disadvantageous in other applications, a fact 
that is not commonly acknowledged by practitioners using the Dirichlet process. 

This predictive rule framework allows us to consider a generalization of the Dirich- 
let process of the following form: 

Pr(X„+i = Xfc|Xi,...,X„) oc /(iVfc), l<k<K, 
Pr(X„+i^{Xi,...,X„}|Xi,...,X„) oc g{e) (3) 

That Pr(X„_,_i = X^l-'^i, can only depend on the sa mple through has 



been referred to as "Johnson's sufficientness postulate" by IZabelll (|1992l) . In the 
remainder of this section, we focus on two particular sets of choices for functions 
/ and g that give us alternatives to the Dirichlet process. 



2.1 Uniform prediction rule 

The prediction rule ^ suggests that the use of a Dirichlet process prior will lead to 
partitions that are dominated by a few large clusters, since larger clusters will tend 
to attract new observations in the sequential creation of a partition. We might pre- 
fer a prior distribution over partitions that more uniformly spreads observations 
between clusters by letting /un(^a:) = 1 and g\j^{9) = 9. The use of these functions 
gives us the following uniform prediction rule: 

Pr(X„+i =Xfc|X,UN) = l<k<K, 
Pr(X„+i^{Xi,...,X4|X,UN) = (4) 
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Here, the probability that the (n + 1) — th observation joins one of the K existing 
clusters is a discrete uniform over these clusters, and is not related to t he cu rrent 
size of each cluster. This prior over partitions was used in lOin et al\ (|2003l ) and 
Tensen and Liul (|2007h , but the theoretical properties of this prior distribution have 



been relatively unexplored. This exploration is a major focus of this paper. 



2.2 Pitman- Yor prediction rule 

We also consider a final prediction rule that serves as a compromise between the 
prediction rules ^ and (H) in terms of the preference towards sequential addition 
of new observations into already large clusters. Consider the functions /pY(^fc) = 
Nk — a and gpy^O) = 6 + K ■ a, in which case we have the Pitman- Yor prediction 
rule: 

Pr(X„+i = X,|X,PY) = l<k<K, 

n + t) 

9 + K -a 

Pr(X„+i^{Xi,...,X4|X,PY) = — — (5) 

n + u 

We see a "rich-get-richer" property similar to the Dirichlet process, but with an 
additional parameter a (0 < a < 1) which serves to reduce the probability of 
adding to a n existing cluster . This compromise prediction rule arose from a process 
studied by iPitman and Yo3 (|1997l) . Although this process has not received much 



attention in the statistics literatu re, it has been used in clustering applications in 



natural language processing (eg. Teh (j200^)). 



In the remainder of this paper, we will compare these two alternative processes 
to the popular Dirichlet process in terms of several characteristics of interest of 
their resulting partitions. Specifically, we will focus on the number of clusters 
Kn as well as the distribution of cluster sizes that arise from the partitioning of n 
observations. When analyzing the distribution of cluster sizes for n observations, 
we do not focus on the raw sizes of each cluster N„ = (A^^i, . . . , Nk„) directly, but 
rather concentrate on summary variables C„ = (Ci,n, C2,n, • • • , Cn,n) where Cj^n is 
the number of clusters of size j in the partition of n observations. In our experience, 
the number of clusters Kn and the histogram of cluster sizes C„ are the primary 
statistics of interest when summarizing partitions. 



3 Asymptotic Behaviour of Prior Prediction Rules 

In this section, we compare the prior distributions implied by @, and © in 
terms of the asymptotic behaviour of our partition characteristics, the number of 
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clusters Kn and the histogram of cluster sizes C„. We use the conditional notation 
( • |DP), ( ■ |PY), and ( ■ |UN) to indicate that the random variables Xi, X2, ... fol- 
low the Dirichlet process. Pitman- Yor, and uniform prediction rule, respectively, 
though this notation is sometimes suppressed if the prediction rule under consid- 
eration is clear from the context. We first review the asympotic expectation for 
and Cj „, which have been studied extensively for the Dirichlet process prediction 
rule. We then review some previous results for the Pitman- Yor process prediction 
rule which are less well known in the statistical community. Finally, we present 
possibly novel asymptotic results for the Uniform prediction rule, which has not 
been studied extensively in the previous literature. 



3.1 E(A'„) and E(Cj „) for the Dirichlet process prediction rule 

Assume that Xi, . . . , X„ are generated using the Dirichlet Process prediction rule. 
Observe that = EU KX, i {Xi, X,_i}), 

n n Q 

i?(ir„|DP) = J]Pr(X, i {Xi,...,X,_i}) = ^ ■_i , g - 
and as n ^ 00, 

Q 



E 



(ir„|DP) = Y, ■ ,^n ^ Ohgn. (6) 
? — 1 + y 



Arratia et al. (|2003 ) demonstrated the following result for the asymptotic expecta- 



tion of the cluster sizes C under the Dirichlet process prediction rule. For arbitrary 

e>o, 

lim E(C,-„|DP) = - (7) 

n— >oo J 

This simple result implies that, regardless of the value of 9, the expected number 
of clusters of a given size j is inversely proportional to that size j. In fact, for 
the Dirichlet process, we have a more general c haracterization of t he asympotic 
behaviour of cluster sizes C. It is also shown in Arratia et al. (zoos') that, as n — »• 



00, (Ci^n, ■ ■ ■ , C'n,n) |DP converges in distribution to (Zi, . . . , Zn) where each Zj is 
independent and 

Zj ~ Poisson(e/j) (8) 

It is also worth noting that since Kn = Cj,nf one can use © to also obtain the 
asymptotic result for £'(_ft'„|DP) given in 1^. 
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3.2 E{Kn) and E(Cj for the Pitman- Yor prediction rule 

Suppose < a < 1 and Xi, . . . , X„ are generated using the Pitman- Yor prediction 
rule. Again, using the observation that Kn = X]j=i K^j ^ {^i^ •••5 -^i-i})/ have 

E{Kn) = E[E(ir„|ir„_i)]=E[;r„_i + P(X„^{Xi,...,X„_i}|ir„_i)] 

'9 + aKn-1 



+ E 



e + n-1 



a + 9 + n - 1 ^.^^ . 



9+n-l 9+n-l 
With the initial condition E{Ki) = 1, this recursive relationship is solved by 

E(K IPY) = + Tia + 9 + n) _ 9_ V{l + 9) 

and so, as n 00, 

E{Kn\PY) ^ K{a,9) ■n'' (10) 

where K{a, 9) = T{1 + 9)/aT{a + 9). This result is also given in IPitmanI \200^ . 
along with the following result for the distribution of cluster sizes C„, 



Cj^n a.s. _ allLl' 



a 



Paj = ^^7^ for every j = 1,2,... (11) 

Combining ([TT]) together with (|T0|) suggests that, as n ^ 00, 

E{C,,n\FY) ^ . g^g^y 3 = 1,2,... (12) 



3.3 E{Kn) and E(Cj „) for the Uniform prediction rule 

The previous literature is far more sparse with regards to the Uniform prediction 
rule. We provide the following potentially novel result for the expected number of 
clusters under the Uniform prediction rule. 

Theorem 3.1. 




which is proven in Appendix |Al Thus, we see that as — > 00, 

E{K^\VN) ^ K{9) -n^ (13) 

where K{9) = ^/29. We also suggest the following conjecture for the distribution 
of cluster sizes C„, based on our simulation results from the following Section HI 
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Conjecture 3.2. 

E{Cj^n\PY) ^6. fls n — > oo for every j = 1,2, . . . 

This conjecture fits nicely with the underlying intuition of the uniform prediction 
rule, that new observations are equally likely to join any of the previously existing 
clusters, regardless of size. 

To summarize our asymptotic results, the expected number of clusters grows 
logarithmically with the sample size under a Dirichlet process prediction rule, 
whereas the uniform prediction rule leads the expected number of clusters Kn to 
grow with the square root of the sample size. Interestingly, the Pitman- Yor predic- 
tion rule implies that the expected number of clusters grows at a rate of n", which 
means that depending on the value of the additional parameter a, the Pitman- Yor 
prediction rule can lead to a slower or faster growth rate for Kn than the uniform 
prediction rule. For a = 0.5, the expected number of clusters grows at the same 
rate for the Pitman- Yor and uniform prediction rules, though the distribution of 
cluster sizes C„ for these two rules is clearly quite different, as seen in the results 
above as well as our simulation results in Section |4]below. In fact, the distribution 
of cluster sizes C„ for the Pitman- Yor prediction rule show similar characteristics 
to C„ from the Dirichlet process prediction rules, despite the closer similarity be- 
tween the Pitman- Yor and uniform prediction rules with respect to the number 
of clusters Kn- The general conclusion is that the Pitman- Yor process can be con- 
figured to look similar to a uniform process in terms of the expected number of 
clusters, but not in terms of the uniform distribution of cluster sizes, which is a 
unique aspect of the uniform process. 

4 Simulation Comparisons in Finite Samples 

The asymptotic results presented in the previous section are not necessarily appli- 
cable to finite samples, where the distribution of cluster sizes are more sensitive to 
the finite sample constraint 



We can appraise the consequences of these three prediction rules in finite samples 
by simulation. For various values of n and 9 and for the each of the Dirichlet 
process. Pitman- Yor (with a = 0.5) and uniform prediction rules, we simulated 
1000 independent partitions. Each of these partitions gives us a simulated number 
of clusters Kn and distribution of cluster sizes Cn — (Ci,r!,5 C2,ni ■ ■ ■ i Cn,n) Under our 
three prediction rules. 



n 




(14) 
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4.1 Comparison of Kn between prediction rules 

In Figuredl we examine the relationship between the number of observations n and 
the average number of clusters Kn (averaged over the 1000 simulated partitions). 
We see that the rate of growth of Kn is the same for the uniform and the Pitman- 
Yor {a = 0.5) prediction rules, which agrees with the equality suggested by (|T0)) 
and ((131) when a = 0.5. Also, as postulated in Section |3^ we see that depending 
on the value of a, the Pitman- Yor prediction rule can show either slower {a = 0.25) 
or faster {a = 0.75) rates of growth of Kn compared to the uniform prediction rule. 
The rate of growth of Kn for the Dirichlet process prediction rule is slower than 
any of the other processes, as suggested by the logarithmic rate in dD. 

Figure 1: Expected number of clusters Kn as a function of sample size n for differ- 
ent values of 9. Both axes are plotted on the log scale. 



9=1 8 = 10 6=1 oo 




4.2 Distribution of Cluster Sizes under DP prediction rule 

Figure |2] is a plot of Cj ^, the number of clusters of size j, as a function of j. Cj^n 
was calculated as the average over the 1000 simulated independent partitions of 
Cj^n under the Dirichlet process prediction rule. Data is plotted on a log-log scale 
and the line in each plot from the upper left corner to the lower right comer is the 
relationship f{j) = 9/j. By equation dZ]), as n — > oo, the relationship between the 
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points (j, Cj^n) should approach this line, which we do observe when comparing 
the top row {n = 1000) to the middle row (n = 10000) to the bottom {n = 100000) 
row in Figure |2l However, it is interesting to observe the substantial divergence 
from this relationship due to the finite sample size constraint, especially in the 
plots with increasing values of (^^ = 10 and 9 = 100). 



Figure 2: Dirichlet process prediction rule: „ as a function of j for different 
values of 9 and n. Both axes are plotted on the log scale. The red line indicates the 
line fU) = 9/ J. 
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DP: 6 = 1 , n = 10000 DP: = 10 , n = 10000 DP: 6= 100 , n = 10000 
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DP: 9 = 1 , n = 1e+05 DP: 6 = 10 , n= 1e+05 DP: 9 = 100 , n = 1e+05 
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4.3 Distribution of Cluster Sizes under PY prediction rule 



Figure |3] is a plot of Cj^n as a function of j, where Cj^n is now calculated as the 
average over the 1000 simulated independent partitions of Cj „ under the Pitman- 
Yor prediction rule with a = 0.5. Points are again plotted on a log-log scale and 
the line in each plot from the upper left comer to the lower right corner is the 
aymptotic relationship of g{j) given in (|T2|) . We again observe the same divergence 
from the asymptotic relationship due to the finite sample size constraint, which is 
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more substantial in our simulations with n = 1000 compared to n = 10000 or 
n = 100000. 



Figure 3: Pitman- Yor {a = 0.5) prediction rule: (7^ „ as a function of j for different 
values of 9 and n. Both axes are plotted on the log scale. The red line indicates the 
relationship g{j) suggested in (|T2)) . 
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PY (a = 0.5) : 6 = 1 , n=100000 PY (a = 0.5) : 6 = 10 , n.100000 PY (a = 0.5) : 6 = 100 , n.100000 
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4.4 Distribution of Cluster Sizes under UN prediction rule 

Figure |4] is a plot of Cj^n as a function of j, where Cj^n is now calculated as the 
average over the 1000 simulated independent partitions of Cj „ under the uniform 
prediction rule. Unlike Figures |2]|3l the points in Figure |4] are plotted on the log-log 
scale, and the line along the top of each plot is the relationship f{j) = 9. Based 
on Figure m we conjecture that lim„^oo -E'(Cjy„|UF) = 9, though clearly we again 
see a deviation due to the finite sample size constraint. We again observe that the 
deviation from the conjectured relationship is most substantial in the upper-right 
comer, when n is smaller {n = 1000) and 9 is larger {9 = 100). 

We draw attention to several conclusions from our simulation results. The first is 
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Figure 4: Uniform prediction rule: (7^ „ as a function of j for different values of 
9 and n. Both axes are plotted on the log scale. The red line indicates the line 

hU) = e. 
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the interesting similarities and differences between the Uniform and Pitman- Yor 
prediction rules. When a = 0.5, both the Uniform and Pitman- Yor processes show 
the same rate of growth for the expected number of clusters E(A'„). However, the 
two processes are quite different in terms of their resulting distributions of cluster 
sizes Cj „, as can be seen in Figures |3] and HI In addition, from examining Figures Ill- 
Ill, we note that there is substantial deviation from the asymptotic relationship for 
larger cluster sizes, especially in smaller samples (e.g. n = 1000). This deviation is 
due to the finite sample constraint (|T4l) , which is not acknowledged by the asymp- 
totic relationships presented in Section |3l 



5 Exchangeability of Prediction Rules 

When using a sequential prediction rule framework for generating prior distribu- 
tions over partitions, one also needs to address the issue of exchangeability. A par- 



12 



tition is exchangeable if the joint prior density of the partition is un affected by the 



order in which the clusters were generated. As pointed out by lPitman (2002), most 
sequential prediction rules will fail to produce a partition that is exchangeable. 
Consider a partition of n observations into K clusters with sizes N = (A^i , . . . , A^^-) 



which are listed in the same order in which they were created. iPitmanI (|20Q2l) 
states that the partition generated by a particular prediction rule is exchangeable 
if and only if the joint density p(N) is a symmetric function of these cluster sizes 
N = {Ni, . . . , Nk). The Dirichlet process prediction rule ^ gives the joint density, 

e^YiiNi-iy. 

P(N) = ^ (15) 

n(^+^-i) 

i=l 

while the Pitman- Yor process prediction rule dS]) gives the joint density, 

K-l K Ni~l 



n 



P(N) = ^ (16) 



i=l 



Both of these joint densities ((15)) and (|T6l) are symmetric functions of the cluster 



sizes (A^i , . . . , Nk), which implies that the Dirichlet process and Pitman- Yor process 
prediction rules both directly lead to exchangeable partitions. However, the joint 
density from the Uniform prediction rule @ is 

,(N) . ^Pi±il) (17) 

i=l 

which is not a symmetric function of (A^i, . . . , N^) in the denominator. This means 
that we will get different values of the prior density ^TT} for different, but ex- 
changeable orderings of unequally-sized clusters. However, this lack of exchange- 
ability for the uniform process can be addressed by defining a "signature" of the 
partition that is identical for exchangeable partitions (|Green and Richardson , 2001 ) 



For example, if we let p*(N) = k ■ p(N') where N' is N with the N/s arranged in 
order from the largest cluster to the smallest, then the calculation of (|T7l) for N' will 
be the same for all exchangeable values of N. Comparisons of partitions generated 
by the uniform process should be performed on the signatures of these partitions 
instead of the partitions themselves. 
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6 Discussion 



We have explored and compared the characteristics of three prior distributions for 
partitions of random variables. The popular Dirichlet process prior has a "rich get 
richer" property that leads to partitions with a small number of relatively large 
clusters. This property is not usually acknowledged by practitioners when using 
the Dirichlet process within a hierarchical model. An important consequence of 
the "rich get richer" property is that the number of clusters grows slowly with 
an increasing number of observations. In fact, the expected number of clusters 
grows logarithmically as the sample size increases to infinity. We have presented 
two other priors for random partitions as alternatives to the Dirichlet process. The 
Pitman- Yor process prior includes an additional parameter < a < 1 that helps 
to dampen this "rich get richer" property. This parameter ol controls the growth 
of the number of clusters: the expected number of clusters grows at a rate of 
as the sample size goes to infinity. However, we observed that the distribution of 
cluster sizes under the Pitman- Yor process still shows similar characteristics to the 
Dirichlet process. 

As a more substantial alternative to the Dirichlet process, we introduced a uni- 
form process prior that eliminates the "rich get richer" property completely. We 
presented a novel result that shows that the expected number of clusters under 
this uniform process grows with the square root of n. Our simulation studies also 
demonstrated a dramatic difference in the distribution of cluster sizes between the 
uniform and Dirichlet process. In many applied settings, these diff erences in prior 
assum ptions may not be influential on posterior inference ( eg. 



(|2007h ). However, as demonstrated by iGreen and RichardsonI (|200l!) , the "rich get 



ensen and Liu 



richer" property of the Dirichlet process priors can persist in the posterior distri- 
bution for many datasets. When the data does not contain strong clustering rela- 
tionships, the uniform prior distribution will be less likely to group variables into 
large clusters, which is a more conservative tendency in terms of inferring cluster- 
ing relationships. We suggest that the extra cautiousness of the uniform process 
could be advantageous when in situations where a partition must be inferred over 
a large number of variables based on relatively noisy data. We hope that the results 
derived in this paper can provide some guidance as to the appropriate prior choice 
for future non-parametric Bayesian clustering applications. 
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A Proof of Theorem 



We first define Tk = mf{m > Tk-i] Xm ^ {Xi, . . . , Xm-i}. Tk is the "waiting time" 
(number of observations needed) until the A;-th unique observation. Under the 
Uniform prediction rule, = Yl'^=i '^i where ~ Geometric(^/ {6 + i — 1) and the 
Ti's are independent, so 



9 29 \ 29 

k 



In terms of Tj, 



Kn = max{j; Tj < n} = l{Tj < n) 

i=i 

We first prove a strong law for the convergence of Tk. Let e > 0. From Chebychev's 
inequality and (|T8l) , 

p(in-E(n)i>.i=)<^<^ (18) 

From ((181) , we have that, 

C{9,e) 



F {\Tk2 - ETk2\ > ek^) < 



and so by the Borel-Cantelli lemma, P (|Tfc2 — E(Tfc2)| > ek^) = 0. Since e > was 

chosen arbitrarily, it follows that ''^ ~4 ''^ — * almost surely and hence ^ ^ 
almost surely Now, let m = [Vk\ . Since T^ is increasing, 

T^2 < ^ < T(^rn+l)2 ^^^^ 



(m + 1)4 - P - 

Since — > 1, both sides of the inequality (|T9l) converge to (2^^)"^ almost surely, 
and so 

T 1 

-4 ^ — almost surely (20) 

The strong law ((20|) easily implies a strong law for ii'n as follows. Note that Tk„ < 
n < Tk„+i and, consequently, 

n n n 
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Since Kn — > oo almost surely and Tk/k"^ — > 1/(2^^) almost surely, it follows that 
the left and right hand side above both converge to 1/ {29) almost surely. Thus, 
K'^/n — ^ 29 almost surely and so 

K 

\f2Q almost surely (21) 



n 



From the strong law (|2T|) and the dominated convergence theorem, we have 

.0. (22) 



n 

We combine (|22)) together with following result, 

EKl = EKn + 29{n- EKn). (23) 

to give us 

=^ ^ 29. (24) 

n 

We prove result (|23l) in Appendix |Bl Finally, using (|24l) together with Fatou's 
lemma and Jensen's inequality, gives us 



29 < liminf — < limsup — ^ < limsup \j = V29. 



which proves the result 



'n 

under the Uniform prediction rule. 



B Result relating E(Kn) to E{K^) 

Recall the definition of Tj from Appendix|A]and now define M„ = K.^ +1. Consider 
the "waiting time" Tm„ until the observation that creates our {Kn + l)-th unique 
cluster. We relate E(K„) to E{Kl) by calculating E(Tm„) in two different ways. 
First, observe that 

/ oo \ oo 

E(TmJ = EK^r,-I(A;<M„) =5^E(rfc)-P(A;<M„) 



9-1 



k=l / k=l 

oo T oo 



5^ P(A; < M„) + ^ 5^ A; ■ P(A; < M„) 



k=l k=l 

^E(M„) + 1e(M„(M„ + 1)) 
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which, since M„ = Kn + 1, simplifies to 

E(TmJ = 1 + E(i^„) (l + ^) + ml)^ (25) 

Now, observe that Tm^ = n + E,- KM^+j = M„) and so E(TmJ = « + E,- P(Mn+j- = 
M„) where 

k 

= ^P(Tfc<n<Tfe+i)P(j< 
= 5^P(M„ = A; + l)P(j<r,.+i). 

A; 

It follows that 

E(rA,J = n + J]J]P(M„ = A; + l)P(j<rfc+0 
i k 

= n + ^P(M„ = fc + l)^P(j<r,+i) 
= n + ^P(M„ = A; + l)E(rfe+i) 

+5:p(ir„=fc)^ 



77, 



which can be simplified to 



E{TMj=n + l + E{K^)^ (26) 



Combining ((25]) and (|26l) gives the result (|23|) : 

References 

Arratia, R., Barbour, A., and Tavare, S. (2003). Logarithmic Combinatorial Structures: 
a Probabilistic Approach. Monographs in Mathematics. European Mathematical 
Society, Zurich, Switzerland. 

Barabasi, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. 
Science 286, 509-512. 



17 



Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference 
using mixtures. Journal of the American Statistical Association 90, 577-588. 

Ferguson, T. (1973). A bayesian analysis of some nonparametric problems. Annals 
of Statistics 1, 209-230. 

Green, P. and Richardson, S. (2001). Modelling heterogeneity with and without the 
Dirichlet process. Scandinavian Journal of Statistics 28, 355-375. 

Jensen, S. and Liu, J. (2007). Bayesian clustering of transcription factor binding 
motifs. Journal of the American Statistical Association Forthcoming. 

Liu, J. (1996). Nonparametric hierarchical Bayes via sequential imputations. Annals 
of Statistics 24, 911-930. 

Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture models 
based clustering of gene expression profiles. Bioinformatics 18, 1194-1206. 

MuUer, P. and Quintana, F. A. (2004). Nonparametric bayesian data analysis. Sta- 
tistical Science 19, 95-110. 

Pitman, J. (2002). Combinatorial stochastic processes. Tech. Rep. 621, Department 
of Statistics, University of California at Berkeley. 

Pitman, J. and Yor, M. (1997). The two-parameter poisson-dirichlet distribution 
derived from a stable subordinator. Annals of Probability 25, 855-900. 

Qin, Z. S., McCue, L. A., Thompson, W., Mayerhofer, L., Lawrence, C. E., and Liu, 
J. S. (2003). Identification of co-regulated genes through bayesian clustering of 
predicted regulatory binding sites. Nature Biotechnology 21, 435-439. 

Teh, Y. W. (2006). A hierarchical bayesian language model based on pitman-yor 
processes. In ACL 2006. 

Zabell, S. L. (1992). Predicting the unpredictable. Synthese 90, 205-232. 



18 



